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Vector quantization has been used in coding applications for several years. 
Recently, quantization of linear predictive coding (LPC) vectors has been used 
for speech coding and recognition. In these latter applications, the only method 
that has been used for deriving the vector quantizer code book from a set of 
training vectors is the one described by Linde, Buzo, and Gray. In this paper, 
we compare this algorithm to several alternative algorithms and also study 
the properties of the resulting code books. Our conclusion is that the various 
algorithms that we tried gave essentially identical code books. 

I. INTRODUCTION 

The technique of vector quantization for LPC voice coding has been 
in use for several years, and has been shown to be of great utility for 
LPC analysis/synthesis systems. 1 " 4 Recently, vector quantization of 
LPC vectors has been applied to speech-recognition systems both in 
direct applications 5,6 and in conjunction with work on the application 
of hidden Markov models (HMMs) to recognition. 7,8 

The main idea of vector quantization is summarized as follows: 
assume that a training set { T\ of I LPC vectors is given. It is desired 
to find a code book of M * LPC vectors such that the average distance 
of a vector in { T\ from the closest code book entry is minimized. Thus 
we wish to find a set \R\ of reference vectors that minimizes the 
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average distance Di(M*) given by 



DAM*) = min 
|R| 



i I min [d(TuRm)] 

1 i=i l«m<Af* 



(1) 



where d(Ti,R m ) is the LPC distance between training vector T, and 
code book entry R m . 

The optimum code book is generated by a method similar to the K- 
means algorithm. Starting with an initial guess of M* entries, each 
vector of the training set is assigned to the closest entry. The centroids 
of the M * subsets (clusters) obtained in this manner are used as new 
trial entries in the code book, and the iteration is continued until some 
stopping criterion is satisfied. 

For large M *, the choice of initial guesses can be quite important, 
and it is unlikely that a randomly chosen initial guess is a good one. 
For this reason the splitting algorithm was devised in Ref. 1. In this 
algorithm a code book of M = 2 entries is optimized, as described 
above, starting with a random initial guess. Next, each optimum code 
book entry for M = 2 is split into 2 and used as an initial guess for a 
code book of size 2M. This process is used until M = M*. To 
distinguish this algorithm from others considered later, we call it the 
binary-split algorithm. 

To the best of our knowledge, all speech-related applications of 
vector quantization so far have used this binary-split algorithm. How- 
ever, a priori, the requirement that every code word be split appears 
to be too restrictive. For example, after optimizing an M = 2 code 
book, if one cluster contains almost all the training set and the other 
contains just a few elements, it might be argued that only the larger 
cluster should be split. Thus it is of interest to consider "single-split" 
algorithms in which a single cluster is split at each iteration. 

For very large M* (e.g., 1024 or 2048) single-split algorithms might 
require prohibitive amounts of computation. However, M* on the 
order of 64 or 128 can be quite useful in certain applications. 8 In these 
cases a single-split algorithm is quite feasible. In any case, it is of 
interest to know whether or not a single-split algorithm yields a better 
code book than the binary-split algorithm. 

There are at least three reasonable ways of implementing the 
splitting rule of a single-split algorithm for training the vector quan- 
tizer. To describe these three splitting rules we need some definitions. 
Let 

\f M (m)\ = The set of training vectors represented by the mth code 
book entry (cluster) in a size M vector quantizer 

C M (rn) = The number of training vectors in T M (m) 
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dhiim) = The average distance (distortion) of the Cjif(m) vectors 
from the mth code-book entry 

Dm{tti) = The total distance (distortion) of the Cm(wi) vectors. 

We then have the relationships 



/= I C M (m) 

m=l 

I Cu(m) 

du(m) = 7^-—- I d(f M (m) q ,R m ) 



(2) 



(3) 



D M (m) = C M (m)-d M (m). (4) 

Using eqs. (2) through (4) we can write the average distortion of eq. 
(Das 



D[{m) = min 

\R\ 



1 D M (m) 



m=l 



I C M (m) 



(5a) 



mm 
l«l 



_m=l 



£ d M (m)C M (m) 

m-l 

E C M (m) 

m=l 



(5b) 



Based on the above definitions, the three splitting rules we have 

considered are: 

Rule 1: Split the cluster, m, with the largest number of vectors, 

CM(m). We denote the resulting (vector quantizer) VQ 

code -word set as R c . 

Rule 2: Split the cluster, m, with the largest average distortion, 

d/uim). We denote the resulting VQ code- word set as R^. 
Rule 3: Split the cluster, m, with the largest total distortion, 
Z)Af(m). We denote the resulting VQ code-word set as Rq. 
The key issue is how do the different splitting rules affect the prop- 
erties of the resulting vector quantizer — in particular the average 
distortion [eq. (1)] and the coverage of the LPC space. 

We have run a series of experimental evaluations of the single-split 
and binary-split algorithms for training the VQ. We have found that 
each of the different splitting criteria leads to a different reference 
prototype set (VQ code book); however, all the VQ sets had essentially 
the same average distortion. We were also able to show that the 
coverage of the LPC space for all VQ sets was identical, and that the 
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average distance of any one VQ set from another VQ set was smaller 
than the average distortion of the training set. Hence, the different 
implementations of the training algorithm for the VQ lead to equiva- 
lent VQ reference sets. Thus for any practical application the simple 
binary-split algorithm is effective for deriving the VQ code book 
entries. 

The outline of this paper is as follows. In Section II we review the 
Linde et al. 1 implementation of the binary-split VQ training algorithm 
and show how we modified it to handle the single-split case. In Section 
III we discuss the results of several experiments on testing the different 
implementations of the training algorithm. In Section IV we provide 
a discussion and summary of the results. 

II. IMPLEMENTATION OF THE VQ TRAINING ALGORITHM 

The implementation of the VQ training algorithm is essentially the 
one proposed by Linde et al. 1 A flow diagram of this procedure for the 
binary-split case is given in Fig. la and for the single-split case in Fig. 
lb. Given M code words, each vector of the training set T is assigned 
to the code word closest to it. The average distortion X)/(M) is 
computed for this assignment of the / training vectors to M clusters. 
M new code words are obtained as centroids (i.e., averaged normalized 
autocorrelations) of each cluster, and the distortion Dj( M ) computed 
again. This process is iterated until it converges, i.e., until the percent 
change in distortion is less than a preset value e (chosen to be 1 
percent in our simulations). Once convergence is achieved, M is 
doubled by splitting each code word into two. The entire process is 
repeated until M = M*. The iteration is initialized by choosing two 
arbitrary code words. 

In our implementation, we made one modification to the VQ training 
algorithm of Fig. 1. We inserted a check after the classification of the 
training set vectors to see if any cluster is empty (i.e., contains none 
of the training set vectors). In such a case the "largest" cluster is split 
into two clusters, and the convergence test is bypassed (to ensure a 
reclassification in which each cluster is nonempty). However, for the 
data used in this experiment, an empty cluster never occurred. In 
subsequent tests with larger M* we did encounter such cases. 

For the single-split algorithm (Fig. lb), only one modification is 
required. After convergence, only the "largest cluster" is split. Here 
largest can refer to the cluster with the largest average distortion, total 
distortion, or count. 

For a convergence criterion of e = 1 percent, typically it takes three 
to six iterations of the classification loop to obtain a convergent set of 
clusters and centroids. We also found that the algorithms of Fig. la 
and lb work extremely reliably over a broad range of types of training 
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data (e.g., collected from a single talker, collected from many talkers, 
collected from a corpus of isolated words, collected from sentence- 
length material, etc.). 

III. COMPARISON OF THE BINARY- AND SINGLE-SPLIT ALGORITHMS 

To compare the performances of the binary- and single-split VQ 
training algorithms of Fig. 1, several tests were run. The database 
consisted of a set of 39,708 LPC vectors. The LPC analysis used a 
6.67-kHz sampling rate and an eigth-order analysis of 300 sample (45 
ms) frames of speech. The sample frames had been preemphasized 
with a simple, first-order digital network (preemphasis factor of 0.95) 
and windowed by a 300-sample Hamming window. Frames were taken 
100 samples apart across the duration of each word of a series of 1000 
isolated words (digits) spoken by 100 talkers (50 male, 50 female). All 
recordings were made over dialed-up telephone lines through a local 
PBX connection. All silence outside the spoken words was eliminated 
by a word endpoint detector; 9 hence, all LPC training frames were 
from within word boundaries. 

Several aspects of the binary- and single-split training algorithms 
were studied. The first question considered was whether the two 
training procedures yielded identical results (i.e., whether the resulting 
LPC code words and the clusters from which they were derived were 
identical). Figure 2 shows plots of the cluster splitting for an M * = 8 
solution for the binary-split algorithm (Fig. 2a) and the single-split 
algorithm based on average distance splitting (Fig. 2b). It can be seen 
that the resulting eight clusters in the single-split case come from very 
different splits than those for the binary-split case. For example, in 





(a) 



(b) 



Fig. 2— Splitting charts for an M* = 8 vector quantizer with splits based on average 
distortion, (a) The binary-split training algorithm, (b) The single-split algorithm. 
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the single-split case, final clusters 6 and 7 come from four splits of the 
original cluster 2, whereas final clusters 1 and 2 come from single 
splits of original clusters 1 and 2. In the binary-split case all final 
clusters come from two splits of original clusters 1 and 2. Similarly, 
the actual clusters were grossly different for the three different criteria 
for the single-split algorithm. 

The next question we considered was how the different training 
procedures differed in performance. Figures 3 through 5 show a series 
of plots of statistics comparing some of the details of the individual 
training procedures. For each of these plots, Parts (a) through (d) 
show results for the binary-split case, the single- split case based on 
count, the single-split case based on average distortion, and the single- 
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Fig. 3 — Plots of count ratio (maximum cluster count divided by minimum cluster 
count) as a function of the size of the vector quantizer, (a) Binary-split training, (b) 
Single-split training based on count, (c) Single-split training based on average distortion, 
(d) Single-split training based on total distortion. 
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SIZE OF VECTOR QUANTIZER W*) 



Fig. 4 — Plots of average distortion ratio as a function of the size of the vector 
quantizer, (a) Binary-split training, (b) Single-split training based on count, (c) Single- 
split training based on average distortion, (d) Single-split training based on total 
distortion. 

split case based on total distortion. The statistics plotted are ratio of 
maximum to minimum cluster count (Fig. 3), ratio of maximum to 
minimum average distortion (Fig. 4), and ratio of maximum to mini- 
mum total distortion (Fig. 5) versus size of the vector quantizer. These 
statistics were chosen because each of them should ideally approach 
1.0 for clusters that are of equal size according to the corresponding 
splitting criterion. For example, we would expect the count ratio to 
approach 1.0 for the split on count criterion but not necessarily for 
the other splitting criteria. 

Examination of Figs. 3 through 5 shows several interesting things. 
As seen in Fig. 3, the count ratio for the binary-split case for M* = 64 
(4.1) is actually smaller than the count ratio for the single split on 
count case for M* = 64 (4.8). The count ratios for the other two split 
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(a) 




SIZE OF VECTOR QUANTIZER (W*) 



Fig. 5 — Plots of total -distortion ratio as a function of the size of the vector quantizer, 
(a) Binary-split training, (b) Single-split training based on count, (c) Single-split 
training based on average distortion, (d) Single-split training based on total distortion. 

criteria are indeed larger than for the split on count, as expected. 
Figure 4 shows that the average -distortion ratio is smallest (4.1) at 
M* = 64 for the single split on average-distortion case; however, the 
distortion ratios for the binary case (4.4) and the single split on total- 
distortion (4.7) cases are only slightly larger. Finally, Fig. 5 shows a 
similar set of results on the total-distortion-ratio statistic in which 
the results for M* = 64 for the binary-split case (2.7) are only slightly 
worse than for the single split on total-distortion case (2.6). 

The results of Figs. 3 through 5 indicate that the binary-split case 
seems to yield cluster training statistics that are almost as good as the 
best statistics for any of the single-split cases in terms of count ratio, 
average-distortion ratio, and total -distortion ratio. Hence, from the 
point of view of cluster statistics, the binary-split cases appear to give 
the best overall performance. 



VECTOR QUANTIZATION 2611 



Two gross performance checks were made on the training algo- 
rithms. In the first test, the average distance between vector quantizer 
sets obtained from the different training procedures was calculated as 
a function of M*. The results of this test are given in Table I. It can 



Table I — Average distance between code book entries of vector 
quantizers designed on the basis of count (R c ), average distortion 
(R d ), total distortion {R D ), and binary splitting (R B ) 



M* 


3(R a fld) 


3(R a R D ) 


Z(R» Rb) 


J(i? 4 Rd) 


*(M*)t 


4 


0.384 


0.019 


0.047 


0.270 


0.707 


a 


0.125 


0.138 


0.157 


0.101 


0.426 


16 


0.148 


0.143 


0.160 


0.065 


0.326 


32 


0.191 


0.108 


0.175 


0.132 


0.255 


64 


0.216 


0.131 


0.148 


0.131 


0.203 



t Average distance between the training vectors and the code words representing 
them. 




M*ON LOG SCALE 



Fig. 6— Plot of average training set distortion DAM*) as a function of the size of the 
vector quantizer. 
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be seen that the average distance between vector quantizer sets is as 
small or smaller than the average distance of the training vectors to 
the code book sets. Hence, the code book sets derived from the different 
training algorithms are, on average, quite close to each other. 

The second test we performed was to measure the average distortion, 
Dj(M*) versus M* for the different training algorithms for values of 
M * from 2 to 64. The results of this test are plotted in Fig. 6. On the 
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Fig. 7 — Plots of code-word coverage in the F t -F 3 , Fi-Fa, and ^2-^3 planes for an M * 
64 vector quantizer. 
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scale of this plot, the differences in average distortion are indistin- 
guishable among the different vector quantizers. 

The third and final question we considered concerns the coverage 
of the space of speech sounds by the optimum code books. A good way 
of displaying this coverage is to look at the code books in the space of 
formant frequencies. The formant frequencies (and bandwidths) for 
each entry of the code book are given by the zeroes of the trigonometric 
polynomial associated with it. Thus each code book may be displayed 



M*= 1024 VECTOR QUANTIZER 
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Fig. 8— Plots of code-word coverage in the F\-F 2 , Fi-F 3 , and ^2-^3 planes for an M* 
= 1024 vector quantizer. 
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as a scatter plot in Fi-F 2 -F 3 space. Projections of this scatter diagram 
on the Fi-F 2 , F1-F3, and F 2 -F 3 planes are shown for a typical code book 
in Figs. 7 and 8 for the code books obtained from the binary-split 
training algorithm for M* = 64 (Fig. 7) and M* = 1024 (Fig. 8). It is 
seen that the code words cover the expected regions in the formant 
frequency planes fairly uniformly. The major difference between the 
coverage of the M* = 1024 and the M* = 64 code books is the density 
of coverage of the areas in the respective formant frequency planes. 
The coverage of the single-split algorithms for M* = 64 was essentially 
identical to that of the binary-split algorithm. 

IV. DISCUSSION 

Our overall conclusion from the tests that compared the fine and 
gross differences in clustering LPC vectors via a VQ training algorithm 
is that all the variations in the training procedure that we studied (i.e., 
different splitting procedures, different convergence criteria, etc.) lead 
to essentially indistinguishable differences in the set of VQ code book 
entries. Since the binary-split algorithm, as discussed by Linde et al. 1 
requires the least amount of computation, it is the best of the algo- 
rithms considered. 

In this paper we present the results of a series of experiments on a 
training set of 39,708 vectors. More recently we have experimented 
with the binary-split VQ training procedure on a number of different 
training sets whose size varied from 10,000 to 600,000 vectors. We 
found that the training procedure always rapidly and reliably con- 
verged to a set of code book vectors whose properties were similar to 
those described in this paper. We are currently using the VQ code 
book sets in work related to speech recognition and speech coding. 
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